Skip to content

Add simulation API observability coverage#594

Merged
anth-volk merged 3 commits into
mainfrom
codex/add-observability
Jul 3, 2026
Merged

Add simulation API observability coverage#594
anth-volk merged 3 commits into
mainfrom
codex/add-observability

Conversation

@anth-volk

Copy link
Copy Markdown
Contributor

Fixes #593

Summary

  • Add policyengine-observability initialization and structured logging for simulation Modal gateway and worker paths.
  • Keep legacy Logfire enabled while marking it as a candidate for replacement by policyengine-observability.
  • Add granular runtime timing segments, nested segment_tree assertions, handled error recording, and direct Logfire legacy helper tests.
  • Update the simulation API lockfile to resolve policyengine-observability v1.3.0.

Validation

  • uv run pytest tests/test_simulation_output_builder.py tests/test_logfire_legacy.py tests/gateway/test_endpoints.py tests/test_modal_bundle_image.py tests/test_observability.py -q passed: 86 tests.
  • uv run ruff format --check fixtures/gateway/shared.py src/modal/app.py src/modal/budget_window_batch.py src/modal/budget_window_scheduler.py src/modal/gateway/app.py src/modal/gateway/auth.py src/modal/gateway/endpoints.py src/modal/gateway/errors.py src/modal/logfire_legacy.py src/modal/logging_redaction.py src/policyengine_api_simulation/main.py src/policyengine_api_simulation/observability.py src/policyengine_api_simulation/simulation_output_builder.py src/policyengine_api_simulation/simulation_runtime.py tests/gateway/test_endpoints.py tests/gateway/test_errors.py tests/test_logfire_legacy.py tests/test_modal_bundle_image.py tests/test_observability.py tests/test_simulation_output_builder.py passed.
  • git diff --check passed.
  • uv run ruff format --check . still reports pre-existing formatting drift outside this branch in tests/test_dataset_uri.py, tests/test_hf_dataset.py, and tests/test_standalone_simulation_contract.py.

@anth-volk

Copy link
Copy Markdown
Contributor Author

Temporarily delaying behind #597

anth-volk and others added 3 commits July 3, 2026 17:39
- errors.py: always emit the stdlib log line with full exception detail;
  record_error/record_event never raise and silently no-op on a disabled
  runtime, so the previous except-fallback was dead code and a correlation
  id could point at nothing server-side.
- logfire_legacy.py: track whether configure_logfire actually ran with a
  token and expose logfire_is_configured(); logfire's own send_to_logfire
  flag defaults to True on an unconfigured instance, so auth.py's
  audit-event gate and errors.py's legacy export gate never worked.
- endpoints.py: stop recording 500/404 request errors on the budget-window
  poll degraded path, which returns a successful 202 seed response; emit a
  degradation event plus stdlib warning instead of false
  http_request_failed events and mislabeled error metrics.
- observability.py + modal/app.py: worker operations now carry the Modal
  identity attributes (platform, runtime_role, modal_environment,
  modal_app_name, modal_function_name) via process_static_attributes();
  plain-process runtimes have no FastAPI-adapter injection point, so the
  attributes never reached worker operation logs.
- budget_window_scheduler.py: replace per-iteration child-poll and
  backoff-sleep segments with bounded aggregate attributes; each segment
  appends a node to the operation's in-memory segment tree, so a
  near-timeout batch grew memory without bound and emitted an operation
  log line large enough to be truncated.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Modal's Function.from_name / Dict.from_name / FunctionCall.from_id return
lazy handles: constructing them only runs a local name check, and the
control-plane RPC fires on first use. The MODAL_DICT_READ and
MODAL_FUNCTION_LOOKUP segments therefore measured ~0ms while the real
network cost ran unattributed.

- list_versions / get_country_versions: move the dict(...) iterations —
  where hydration and the DictContents/DictGet RPCs happen — inside the
  MODAL_DICT_READ segments.
- submit endpoints: fold the lazy Function.from_name into the
  MODAL_FUNCTION_SPAWN segment, which is where its hydration RPC actually
  executes, and drop the MODAL_FUNCTION_LOOKUP segment name.
- budget window scheduler: drop the lookup segment around the lazy child
  Function handle; BUDGET_WINDOW_CHILD_SPAWN already times the spawn.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@anth-volk anth-volk force-pushed the codex/add-observability branch from 8df66db to 3b3abac Compare July 3, 2026 15:39
@anth-volk anth-volk marked this pull request as ready for review July 3, 2026 16:09
@anth-volk anth-volk merged commit b9b734b into main Jul 3, 2026
4 checks passed
@anth-volk anth-volk deleted the codex/add-observability branch July 3, 2026 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add observability coverage to the simulation API

1 participant